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Abstract 

Admixture mapping is a popular tool to identify regions of the genome associated with 
traits in a recently admixed population. Existing methods have been developed primarily for 
identification of a single locus influencing a dichotomous trait within a case-control study 
design. We propose a generalized admixture mapping (GLEAM) approach, a flexible and 
powerful regression method for both quantitative and qualitative traits, which is able to test 
for association between the trait and local ancestries in multiple loci simultaneously and 
adjust for covariates. The new method is based on the generalized linear model and utilizes 
a quadratic normal moment prior to incorporate admixture prior information. Through 
simulation, we demonstrate that GLEAM achieves lower type I error rate and higher power 
than existing methods both for qualitative traits and more significantly for quantitative 
traits. We applied GLEAM to genome-wide SNP data from the Illumina African American 
panel derived from a cohort of black woman participating in the Healthy Pregnancy, Healthy 
Baby study and identified a locus on chromosome 2 associated with the averaged maternal 
mean arterial pressure during 24 to 28 weeks of pregnancy. 
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Introduction 

Admixture mapping, also known as mapping by admixture linkage disequilibrium (MALD), 
has become an important tool for localizing disease genes. A number of admixture mapping 
studies, focused on primarily on African American populations, have successfully identified 
candidate loci associated with common complex traits and biomarkers. Examples include 
hypertension^^ multiple sclerosis,^ cardiovascular disease,^ prostate cancer]^ interleukin 
6 levels,^ end-stage renal disease,^ white blood cell counts,^ blood lipid levels,^ obe- 
sityp^ retinal vascular caliber,^ peripheral arterial disease,^ blood pressure^ and acute 
lymphoblastic leukemia.^ Among these new found susceptibility loci, the association between 
end-stage renal disease and the region harboring MYH9 gene has been reported by multiple 
independent studies PUSES The 8q24 prostate cancer locus^ has been confirmed by a series of 
follow-up admixture mapping and genome- wide association studies (GWAS)^"^^ and the 
locus on 5pl3 contributing to inter-individual blood pressure variation^ has been verified by 
multiple large-scale GWAS.^ 

Admixture mapping is a genome-wide association approach to identify susceptibility loci 
which confer risk or are linked with other loci harboring risk variants for complex-traits which 
have different prevalences between ancestral populations.^^! j n recently admixed popula- 
tions, such as African Americans or Hispanic Americans, the chromosome resembles a mosaic 
of ancestry blocks, with alleles inherited together from one ancestral population within each 
block. The ancestral populations have different risks for the trait, which is assumed to be due 
in part to frequency differences in risk variants. For the ancestry block containing the risk 
variant, it is more likely to have originated from the high risk ancestral population than the 
low risk ancestral population. Hence, detecting the association between ancestry block and 
trait helps us to localize the susceptibility loci. The ancestral status of a block at a specific 
genomic region, or local ancestry, is unobserved and can be estimated based on ancestry 
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informative markers (AIMs), such as single nucleotide polymorphisms (SNPs), which vary in 
frequency across ancestral populations. AIMs tag the status of an ancestry block, similar to 
that of tagSNPs, which are used to characterize common haplotypes in a chromosomal region. 
In the African American population, the linkage disequilibrium due to admixture extends for 
a much wider region than the linkage disequilibrium between haplotypesj^"^ which is also 
illustrated in Figure [TJ Hence, compared to the tagSNP-based GWAS, admixture mapping 
requires many fewer markers to tag the whole genome and therefore increases the detection 
power at a reduced resolution, which is still higher than linkage analysis] ESIMIEI] Moreover, 
admixture mapping is less vulnerable to allelic heterogeneity,^^ since it relies on local 
ancestry instead of alleles directly. 

[Figure 1 about here.] 

Given the local ancestries of each individual EUEzl severa j hypothesis testing-based ap- 
proaches have been proposed to test, one locus at a time, the null hypothesis that the 
AIM is unlinked to the complex-trait/disease. McKeigue^ first applied the transmission- 
disequilibrium testpS to explore the excess transmission of a risk variant from the high risk 
ancestral population at an AIM locus, and later®' proposed a test for gametic disequilibrium 
between an AIM locus and the trait locus, conditional on the parental admixture. Patterson 
et alP^ suggested a Bayesian likelihood ratio test, comparing the likelihood under the alter- 
native hypothesis (a given AIM locus is associated with the trait) versus the one under the 
null hypothesis, for cases and controls respectively. Zhu et al.^ described a Z-score statistic, 
similar to one proposed by Montana and Pritchardj^ for testing the estimated local ancestry 
proportion is equal to one under the null hypothesis for case-control and case-only studies. 

Although considerable research has been devoted to single locus admixture mapping fo- 
cused on dichotomous traits, less attention has been paid to admixture mapping for quan- 
titative traits and to considering multiple loci simultaneously while adjusting for other risk 
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factors. Quantitative traits have been the focus in many admixture mapping studies, such 
as lnterleukin 6 levels as inflammatory biomarkers for cardiovascular disease riskP ankle- 
arm index for peripheral arterial disease,^ central retinal artery equivalent level for retinal 
vascular caliber,^ and white cell count for acute inflammation.^ To apply existing admixture 
methods, the common practice has been to dichotomize subjects with the lowest and highest 
q% (e.g. 20%) of the quantitative trait value as cases and controls. The remaining subjects 
with in-between quantitative trait values are discarded P^El resulting in reduced power. In 
addition, complex traits are commonly caused by joint effects of the multiple genes and other 
risk factors, such as age, sex and smoking status. Investigating the association between AIM 
loci and a trait, one locus a time, without considering other loci or risk factors may capture a 
rather small proportion of joint effects and will possibly lead to inconsistent conclusions .HEE] 

With these motivations we propose regression-based generalized admixture mapping (GLEAM) 
for both quantitative and qualitative traits with the ability to examine the association 
between the complex trait and single or multiple loci simultaneously while also adjusting for 
other risk factors. The new approach is based on generalized linear models (GLMs),^ with 
linear regression for continuous traits, logistic regression for binary (e.g. case-control) traits 
and Poisson regression for count traits. The predictors in GLM include local ancestries at the 
given AIM loci and other risk factors. The local ancestry is defined as the number of alleles 
from the high risk ancestral population, for example, 0, 1 or 2 alleles from African ancestry 
at a given AIM locus. The association examined in GLEAM can be adjusted by other risk 
factors. A related approach has been considered by Hoggart et alP^ for single locus without 
adjustment for other factors. We assume for complex genetic traits that most loci have no 
association with the trait, a few loci may have small to modest association (e.g. odds ratio 
< 2 for binary traits), and the loci with higher proportions of disease-causing alleles from 
the high-risk population would possibly have stronger association with the traitsP*"^ This 



Generalized Admixture Mapping for Complex Traits 5 

prior knowledge is incorporated into GLEAM by using a quadratic normal moment (QNM) 
priorSD f or ^he coefficients in GLM (See more details in "Material and Methods" section) 
with the benefit of reducing the type I error while increasing the power, as demonstrated by 
the simulations in "Results" section. 

The number of AIMs (1500 ~ 3000)^ is usually larger than the number of study sub- 
jects, and keeps increasing ( >4000)P^"^ with advances due to the HapMap project^®' 
and commercially available genome-wide SNP arrays. It is not feasible to consider loci all 
together simultaneously due to the "curse of dimensionality" . Rather, we propose a two-stage 
approach: in the first stage, we examine the association between local ancestries with the 
trait for one locus at a time and select a small subset of susceptibility loci; in the second 
stage, the associations between the various combinations of these selected loci and the trait 
are evaluated and the most significant ones are reported. The associations in both steps are 
assessed by the Bayes factor (BF), the ratio between the likelihood of observed traits under 
the alternative hypothesis (presence of association between single or multiple loci with traits) 
and that under the null hypothesis (lack of association).^"^ 

The local ancestries are unobserved and will be inferred based on the AIMs using the 
Hidden Markov Model (HMM),'^' with the focus on two-population admixture similar to 
that of Falush et alP^ and Patterson et al.^ with one key difference: the recombination 
process is modeled non-parametrically. At each AIM locus, the number of alleles from the 
high risk ancestral population will be imputed multiple times for every subject, using an 
Markov chain Monte Carlo (MCMC) algorithm. Existing approaches only record the imputed 
frequency of the number of alleles from the high risk ancestral population individually^ or 
across the population without accounting for imputation uncertainty.^ In contrast, our 
approach imputes multiple datasets of local ancestries, from which we are able to assess 
the association between the traits and local ancestries directly, while taking imputation 
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uncertainty into account through Bayesian averaging. Importantly, the admixture linkage 
disequilibrium between the AIM loci is preserved in our multiple imputation approach, which 
is crucial for multilocus admixture mapping. 

The remainder of paper is organized as follows. In Material and Methods, we first present 
the HMM for imputing the local ancestries, followed by the specification of the generalized 
linear model for quantitative and qualitative traits with QNM prior density. In Results, 
through the simulations we show that the new approach increases the power of admixture 
mapping while reducing the type I error rates compared to the popular method by Patterson 
et alP. The new approach is applied to data from a large cohort study, the Healthy 
Pregnancy, Healthy Baby (HPHB) Study, and further extensions are considered in Discussion 
section. 



Material and Methods 



Hidden Markov Model 

For a population-based design, suppose we have I unrelated subjects, each of which has the 
same set of J AIMs recorded. The local ancestry is measured by G {0, 1, 2}, the number 
of alleles from the high risk population A (e.g. African) for the zth subject and the jth AIM. 
Sij is unknown and will be imputed using the HMM. For African Americans with African 
and European ancestral populations, HMM assumes that given the S^, the distribution of 
Xij G {0,1,2}, the number of variant alleles, is independent of other Sij> and X^/ with 
j' ^ j and is specified by the observation probability mass matrix Pj = {pj(m,n)} Sx3 with 



Pj(m, n) = Prob(X 
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where pf is the minor allele probability at loci j in the high risk population A andpj^ is the 

corresponding probability in the low risk population B. 

The latent states Si = {Sij}i x j, tagging the status of the ancestry blocks, are unobserved 
and modeled by an Markov chain which considers the genetic recombination events. Let pi 



Generalized Admixture Mapping for Complex Traits 



7 



denote the genome-wide proportion of alleles from the high risk population A for subject 
i, Q i0 = [(1 — pj) 2 ,2pj(l — Pi),p1]' initial state vector, Rij 6 {0,1,2} the number of 
recombination events between AIM loci j — 1 and j, = {q\ (m,n)}3 X 2 the conditional 
state transition matrix given r recombination events between the neighboring AIM loci with 
q\ (m, n) = Prob (SV,- = n | Si(j-_i) = m,Rij = r). The Markov chain Si is governed by the 
state transition matrix = {qij(m,n)} 3x3 with qij(m,n) = Prob ySy = n | S^-i) = rrij. 
Qij = Yll^Qi^P^iRij — r )> where Qf \ Qf' and are specified as 
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and ~ Bin(2, jj) a binomial distribution with jj the probability that a recombination 
event occurs between the neighboring AIM loci in a single chromosome. Consequently, we 
can get, 
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We further specify informative prior distributions for the parameters pf, pf, jj and pi 
involved in the HMM. Although the pf of the high risk population A is unknown, we have 
information on p^, the proportion of the variant allele j in a subpopulation of high risk 
population A (e.g. YRI for African), from the HapMap or 1000 genome projects. Hence, 
we expect that pf would be close to p^ and specify pf ~ Beta [r A p^,T A {l — Pqj)) with 
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the expectation E(p^) = p^ and U [50, 1000] a uniform distribution to reflect the 

uncertainty in borrowing the subpopulation information. A similar specification is chosen 
for p? based on the proportion of the variant allele j in a subpopulation of low risk pop- 
ulation B (e.g. CEU for European). As for jj, it is well known that the recombination 
probability is roughly proportional to dj the genetic distance between (j — l)th and jth 
AIM loci. A common choice is jj — 1 — exp(— Xdj) with A = 6 the number of recombination 
events per Morgan since admixtureP^' However, recombination 'hotspots' can occur along 
the chromosomes where the recombination probabilities are much higher than the other 
regions.^^ For this reason, we avoid the above parametric specification of jj. Instead, 
we let 7j ~ Beta (r 7 7oj, r 7 (l — 7oj)) with the expectation E(jj) = 7oj = 1 — exp(— Xdj). 
Hence, on average the probability of recombination is proportional to the genetic distance 
while allowing significant deviation (e.g. 'hotspots' ) from the average. The deviation is 
measured by r 7 with Varfy) = 103 = ^o- Additionally, for the admixed population, 
we often have knowledge about the proportions of ancestral populations at the population 
level. For example, the African American population in general consists of 80% African 
ancestral population and 20% European ancestral populations^ ^y e Dorrow this popula- 
tion level information to specify pi, the subject specific proportion of high risk population 
A, by letting pi ~ Beta (r p poi, r p (l — p 0i )) with p 0i (e.g. 0.8 for African American) and 
Var( Pl ) = = u . 

We use an MCMC algorithm to sample the local ancestries Si for i = 1,2, ... , /, along 
with other parameters. The details of MCMC are given in the Appendix. 

Generalized linear model with QNM prior 

GLEAM is a regression method that extends the current approaches in various ways. The 
most obvious extension is to accommodate both quantitative and qualitative traits ji through 
a generalized linear model with the ability to adjust for covariates Ei = (En, E i2 , . . . , E iq )' . 
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Specifically, we use the liner model for continuous traits, 



Yi = O + St + a' Ei + Si 



(1) 



and the logistic model for dichotomous traits, 



logit{Prob( yi = 1)} = A, + (3' Si + <x'E 



(2) 



where p local ancestries Si = (Sn, S i2 , ■ ■ ■ , S ip )' are considered and centered to have mean 
zero, j3 = /3 2 , . . . , /3 p )' and ck = (ai, a2, • • • , o> q )' are the regression coefficients for Si 
and £7j respectively, and e$ ~ N(0,<j 2 ). We use the Bayes factor to assess the admixture 
association between local ancestries and the trait of interest. The Bayes factor is the ratio 
between the likelihood of observing the trait under the alternative hypothesis Hi : /?i ^ 
0, 02 7^ 0, . . . , j3 p 7^ and the likelihood under the null hypothesis H : fi\ — fii — ■ ■ ■ — /3 P — 



A prior distribution for /3 is needed to calculate the marginal likelihood of the data under 
Hi, for which we use the QNM prior with the density 



where fN p (-;m,V) is the p-dimensional multivariate normal distribution with the mean 
vector m and covariance matrix V, and r is the dispersion parameter. As shown in the left 
panel of Figure |2j given a 2 and S, the bigger the r, the larger the mode and dispersion 
of the prior. The QNM prior increases the evidence in favor of both the true null and 
true alternative hypothesis, compared to other prior distributions (e.g. intrinsic and Cauchy 
priors) .SSI Moreover, we specify a 2 S as the covariance matrix of the (iterative weighted) least 
square estimation of (3 in the GLM. This choice not only leads to convenient computation but 
also easily incorporates the prior knowledge about the effect of local ancestry on the trait. 
For example, when Si is orthogonal to Ei, X = (S'S)" 1 with S = [Si, S% • • • , Si]' in the 
linear model for the continuous trait. As illustrated by the right panel of Figure^ the QNM 
prior with S = (S'S)" 1 suggests that for each locus, the higher the proportion of alleles 
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from the high risk population (p a ), on average the larger the risk effect of local ancestry. 
Such relationships are frequently observed in admixture mapping. More importantly, when 
we investigate multiple loci simultaneously, it is crucial to take the correlation (linkage 
disequilibrium, LD) between the local ancestries into consideration. Figure [3] plots several 
volcano-shaped bivariate QNM densities for various correlations between two local ancestries. 
It is clear that for two loci with admxiture linkage equilibrium (as shown in panel (a)), such 
as two loci on different chromosomes, their risk effects would be independent; and that for 
two loci with high admixture LD (as shown in panel (d)), usually located in the same gene, 
they would have similar risk effects. 

We use the Bayes factor to compare the likelihoods of observed traits under Hi versus 
under Hq. Intuitively, the Bayes factor is the ratio between the evidences which combine the 
likelihood of the observed traits with the prior probability of association under the Hi and 
H respectively. The larger the Bayes factor, the stronger the evidence would be in support 
of Hi. With QNM prior for (3 under Hi, the Bayes factor can be obtained in the simple 
closed form, 

where T = /3 is the maximum likelihood estimate of (3, adjusted by other 

risk covariates when necessary, S^ 1 is the corresponding covariance matrix estimates and f 
and a 2 are the empirical Bayes estimates. Bayes factor ([3]) will be used to identify the loci 
associated with the traits, detailed as follows. 



[Figure 2 about here.] 



[Figure 3 about here.] 
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Generalized admixture mapping procedure 

We propose a two-stage approach for GLEAM. In the first stage, we examine the marginal 
association between a single AIM locus and the trait, using the Bayes factors ([3]), one locus 
a time for J AIM loci. The loci at which logioBF(y) > 5 are considered susceptibility loci. 
While the 'one locus a time' approach explores the marginal association and is widely used, 
marginal association only reflects part of the relationship between the AIM loci and the trait. 
Several loci in different regions may show associations with the trait. Thus, it is desirable to 
quantify the evidence for joint association of multiple loci with the trait. For this reason, in 
the second stage, we list all possible combinations of susceptibility loci selected in the first 
stage. For each set of susceptibility loci, we can again calculate the Bayes factors for the joint 
association at those loci simultaneously. The most significant ones are reported. The local 
ancestries at the AIM loci are unobserved and imputed from the HMM. The imputation 
uncertainty could be properly accounted for by calculating weighted average of the Bayes 
factors for each imputed local ancestry dataset, which is similar to the strategy used by Guan 
and Stephens^ in imputation-based association mapping for testing untyped variants. 

Simulation Studies 

We carried out simulation studies to assess the performance of GLEAM in terms of type 
I error rate and power under various scenarios and compared it with the method based 
on Bayesian likelihood ratio (BLR) by Patterson et alP^ which is implemented by the 
software ANCESTRYMAP (http://genepath.me.harvardedu/~reich/Software.htm). GLEAM 
and ANCESTRYMAP use slightly different HMMs to impute the local ancestries and AN- 
CESTRYMAP records the proportion of local ancestries only. Because of these differences, 
we assumed the true local ancestries were given and focused on evaluating the ability of 
localizing susceptibility loci, instead of estimating local ancestries. Our simulations were 
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based on empirical data of local ancestries for 1001 African Americans from the HPHB 
Study, 123 with 1296 AIM loci measured across the genome. 

We started by investigating the type I error rates for the local ancestries which were 
scattered around different regions of the genome and in linkage equilibrium. Under this 
scenario, the falsely localized AIM locus would be in the region remote from the true disease 
causing locus, which leads to a false positive finding. We first randomly sampled 1000 AIM 
loci with replacement from 1296 AIM loci for 1000 subjects. At each AIM locus, we simulated 
the local ancestries measured by the number of alleles from the African ancestral population 
from their maximum a posteriori (MAP) frequency estimates under the assumption of Hardy- 
Weinberg equilibrium. Ten sets of trait data were then generated such that we were able to 
assess the type I error rates under the genome- wide threshold level (e.g a = 10 -4 ), by using 
the following null model for continuous traits: 

Yi = aEi + Ei, 

and for binary traits, 

logit{Prob(y 4 = 1)} = aE i} 

where the continuous risk covariate Ei and the measurement error followed standard 
normal distributions. We considered two situations whereby a = in the absence of a 
covariate effect and a = 1 in the presence of a covariate effect. 

We next examined power under the single locus alternative models. We simulated 100 sets 
of traits. Each set included 1000 subjects and one disease associated local ancestry whose 
location was randomly sampled from 259 AIM loci, where the proportion of African ancestral 
population (PAAP) ranged from 0.8321 to 0.8817 and was on the top 20% percentile among 
1296 AIM loci. Given the local ancestry Si, continuous covariates Ei and measurement error 
Ei generated same as that for the null model, continuous traits were simulated from 

ji = aEi + f3Si + Ei, 
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and binary traits from, 

logit{Prob(y i = l)} = a£ i + /3S i . 

Under both models, the (3 was specified as (3 = c x PAAP which reflected the a priori 
observation that the locus with the larger proportion of the high risk ancestral (here African 
American) population usually demonstrated stronger association with the traits. For con- 
tinuous traits, we chose the values of effect size multiplier c as 0.2, 0.25, 0.3, 0.35 and 0.4 
respectively, with the largest possible effect size equal to 0.3527. Similarly, we picked the 
values of c's as 0.4, 0.5, 0.6, 0.7 and 0.8 for binary traits with the largest possible odds ratio 
(OR) equal to 1.8537. 

We further considered a multilocus alternative model where two local ancestries were 
associated with the traits and there existed admixture linkage disequilibrium. To do so, 
we generated an artificial chromosome composed of two pieces from chromosome 1 and 
chromosome 4 with the length 139.50Mb and 114.88Mb respectively for 1000 subjects, based 
on empirical data on local ancestries from HPHB study. In the middle of each chromosome 
piece with 51 loci, there is one locus whose proportion of African ancestry population was 
among the highest in all 1296 AIM loci. In the simulations, those two loci are assumed to be 
associated with traits. We generated 100 sets of continuous and binary traits respectively, 
each of which was simulated similarly to the single locus alternative model except with two 
local ancestries involved and both effect size multiplier c's set at 0.7 for continuous traits 
and 0.35 for binary traits. 

The simulated datasets were analyzed by the GLEAM and the BLR method. Since the BLR 
method was primarily developed for binary traits, the BLR method required transformation 
of continuous traits into binary ones, such as defining the subjects with top 20% traits as 
the cases and the one with bottom 20% traits as controls. 
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Results 

Simulation Studies 

Figure S] presents the empirical type I error rates for both the binary and continuous traits, 
with or without covariate effects. For the GLEAM and the BLR methods, we chose a 
threshold of 2 for log\oBF(y) to control the genome-wide type I error rates. Under the 
null model that all the local ancestries are in linkage equilibrium, the type I error rate is 
controlled at a low level with the median around 5 x 10~ 4 for GLEAM and 4.2 x 10~ 3 for 
the BLR method illustrated in Figure HI In both cases, those type I error rates seem overly 
conservative. However, in the application to real data, slight admixture linkage disequilibrium 
between the AIM loci will significantly inflate the type I error rate close to the nominal levels 
(i.e. a = 0.05 or 0.005), which is discussed in the later paragraphs. Comparing two panels in 
Figure H] reveals that the type I error rates of GLEAM are consistently smaller than those of 
the method based on BLR and are little affected by the presence of covariate effects when 
properly adjusted. The covariates are not considered by the BLR method and have a mixed 
effect on type I error rates, where the median is slightly reduced with the maximal type I 
error rates increased. 

[Figure 4 about here.] 

Power of the methods was also evaluated for binary and continuous traits under the single 
locus alternative model, with or without covariate effects. We considered various effect sizes 
of local ancestries with the results shown in Figure [5j For the binary trait, when the effect 
size is small, the BLR method performs better with larger power. With the increment of the 
effect sizes, GLEAM gradually outperforms the BLR method. For both methods, covariates 
have moderate effects on power, which is more obvious for the smaller effect sizes. For the 
continuous trait, the GLEAM performs significantly better at each effect size. These results 
were expected since the BLR method discards part of the dataset in order to transform the 
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continuous trait into the binary one (case versus control), which inevitably loses power. For 
all situations considered, the power of the GLEAM approach increases with the increment of 
the local ancestry effect size, most rapidly when the effect sizes are smaller and then levels 
off with larger effect sizes. In comparison, the power of the BLR method increases roughly 
linearly. 

[Figure 5 about here.] 

To understand the impact of admixture linkage disequilibrium on type I error rates and to 
evaluate the ability of localizing multiple loci simultaneously, we generated a set of artificial 
chromosomes as described before, where two loci were associated with the traits, named as 
Locus 1 and Locus 2. Besides Locus 1 and Locus 2, we divided the remaining loci into three 
regions: region 1 (REG1) with 42 loci and region 2 (REG2) with 35 loci, where the admixture 
linkage disequilibrium measured by the correlation coefficient between a given locus at these 
regions and Locus 1 or Locus 2 was larger than 0.12 respectively; and region 3 (REG3), 
the unassociated loci which did not belong to region 1 and region 2. Strictly speaking, the 
identified loci except Locus 1 and Locus 2 were all false positives. However, in contrast to 
the loci found in region 3 which were completely false findings, the loci identified in Region 1 
and Region 2 were partially correct and could be regarded as low resolution findings instead, 
since the true associated locus did exist in the nearby region. Therefore, we evaluated the 
false positives in three regions separately. An ideal method under the pre-specified genome- 
wide threshold would lead to few completely false positives in region 3 and to a small number 
of partially false positives in regions 1 and 2, while being able to identify the true associated 
loci with high frequency. 

Table [1] summarizes the frequencies of identified loci for each locus or locus combination 
at different regions by GLEAM and BLR method. For the GLEAM method, we applied the 
two-step approach outlined in the "Generalized admixture mapping procedure" subsection. 
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The results by applying the first step only (GLEAM1) and by applying the two-step approach 
(GLEAM2) were both presented. For binary traits, both the BLR method and GLEAM 1 
could localize both Locus 1 and Locus 2 with high power. The type I error rates in region 1 
were around the nominal level (0.025 and 0.003 respectively). The type I error rates in region 
1 and region 2 were higher than the ones in region 3, which would decrease the resolution 
of the finding. Compared to GLEAM 1, further applying the second step of generalized 
admixture mapping procedure (GLEAM2) could significantly improve the resolution by 
reducing the type I errors in region 1 (from 0.013 to 0.002) and region 2 (from 0.014 to 
0.003). For continuous traits, GLEAM2 also performed best with much higher power and 
lower type I rate than the BLR method. 

[Table 1 about here.] 

Application 

This methodological work was motivated by real data from the Healthy Pregnancy, Healthy 
Baby (HPHB) study, which is a prospective cohort study of pregnant women aimed at 
identifying genetic, social and environmental contributors to disparities in adverse birth 
outcomes in the US southP^ Consistent with previous studies, African American women 
in HPHB have higher risk for maternal hypertension than Caucasian women during the 
pregnancy, which contributes to the poor birth outcomes.^ Even within the African Amer- 
ican subpopulation, some African American women have much higher blood pressures, and 
we hypothesize that one possible contributor may be the percentage of African ancestry. 
To explore this hypothesis, we applied GLEAM to investigate the association between the 
averaged maternal mean arterial pressure (MAP), defined as (1/3 x systolic blood pressure) + 
(2/3 x diastolic blood pressure), during 24 to 28 weeks of pregnancy and local ancestries 
among these pregnant African American women. Clinical and genetic data were available for 
1004 nonHispanic Black (NHB) women. 1509 SNP AIMs were genotyped using the Illumina 
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African American admixture panel. After quality control measures described previously,^ 
the dataset consisted of 1001 NHB women with 1296 AIMs. 

The proposed GLEAM approach was applied to this dataset to identify the local ancestry 
associated with the averaged maternal MAP, a continuous trait, while adjusting for mother's 
age. The local ancestries were multiply imputed based on the HMM. We first examined the 
marginal association between the trait and local ancestries, one locus a time. The results were 
summarized in Figure El where one local ancestry on the chromosome 2 was identified with its 
/o(?io(Bayes factor) = 2.05 exceeding the threshold 2. With only one local ancestry localized, 
the second step of the generalized admixture mapping procedure was unnecessary. The same 
data were analyzed by the BLR method, which treated the subjects with averaged maternal 
MAP more than 93.67 (top 20% quantile) cLS CHS OS and the ones with averaged maternal 
MAP less than 79.33 (bottom 20% quantile) as control. No local ancestry was identified as 
being associated with the averaged maternal MAP with this approach, presumably due to 
its relatively low power compared with the GLEAM approach. 

[Figure 6 about here.] 

Discussion 

By utilizing admixture linkage disequilibrium, admixture mapping is an indispensable tool to 
localize the alleles which are associated with the qualitative or quantitative traits and diseases 
that vary in prevalence across the ancestral populations. The available methods are most 
suitable for dichotomous traits in a case-control study and do not allow for adjustment for 
other risk covariates. In this article, we propose a flexible and powerful generalized admixture 
mapping approach, which is based on the generalized linear model and is able to incorporate 
admixture prior information by using the quadratic normal moment prior and to adjust for 
covariates. The proposed method is applicable to both qualitative and quantitative traits 
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with satisfactory power while controlling the type I error rates at a low level, and is able to 
be easily implemented as we demonstrated with our HPHB example. 

In addition to the flexibility to handle different types of traits, other attractive general- 
izations include consideration of multiple loci simultaneously. As illustrated in Figure dj ad- 
mixture linkage disequilibrium extends much further than haplotype linkage disequilibrium. 
Consequently, if we only examine one locus a time, the local ancestries which are highly 
correlated to the true disease associated local ancestry tend to be identified as significant 
ones as well. As demonstrated by the simulations, those false positives can be significantly 
reduced by considering multiple susceptible loci simultaneously, which reduce the type I 
error rates and improve the mapping resolution. In addition, GLEAM specifies a hidden 
Markov model treating the recombination rates varying across the genome, which allows 
us to infer the recombination "hotspots" in admixture population. Moreover, within the 
generalized linear model framework, it is straightforward to extend the current method to 
populations with more than to two ancestral populations, such as Hispanic populations, 
by adding extra ancestry population covariates. It is also easy to consider the interaction 
between the local ancestries and covariates with the properly specification of the priors on 
interaction coefficients. 

Acknowledgments 

This work was supported by Award Number R01ES017436 from the National Institute of 
Environmental Health Sciences, and by funding from the National Institutes of Health 
(5P2O-RR020782-O3) and the U.S. Environmental Protection Agency (RD-83329301-0). 
The content is solely the responsibility of the authors and does not necessarily represent 
the official views of the National Institute of Environmental Health Sciences, the National 
Institutes of Health or the U.S. Environmental Protection Agency. 



Generalized Admixture Mapping for Complex Traits 19 

Appendices 

MCMC algorithm for EMM 

We propose an MCMC algorithm for posterior computation of HMM as follows. 

(1) Impute the missing AIM X™. Given the Pj and SV,-, X™- G {0,1,2} can be easily 
sampled with probability mass Pj(Sij, X™). 

(2) Update the latent states Si for i — 1, 2, • • ■ , /. Given the Q i0 , and Ri = {Rij}ix j, 
we will use the forward filtering backward sampling (FFBS) algorithm^ to sample the 
Si in one block. The FFBS algorithm mixes more rapidly comparing to the direct 
Gibbs sampler which samples one SV, a time conditional on the remains of Si. Let 
X 3 a = [Xa,Xi 2 , ■■■ , Xij]' and Ri = [R a , R i2 , • • • , Ru}'. We begin the FFBS algorithm 
by calculating = {g^(m,n)} 3x3 with g^(m,n) = Prob(S , i(j _.i) = m, = n \ 
X 3 a , Ri) recursively for j = 1, 2, ■ • • , J as 

qfjim, n) = Prob(S i(i _i) = m, S i:j = n \ X\ v Ri) 



Prob(5 , i( i _i) = m, SV,- = n, X {j 




Prob(X ii 


X\ x ,Ri) 



g^-i) {m)ql r > (m, n) Pj (n, Xj,) 
Prob(X - | Xji\Ri) ' 
where q[ (m) = Q iQ , Prob(X ij | X J a \Ri) = E?n=o En=o Prob ( lS i(i-i) = m > S H = 
n,X i:j | Xl~\Ri), and g£(n) = ELo^K n )' 

We can then sample the Si backward from Sij to Su with 

j-i 

Prob(5 i | X<, Jl,-) = Prob^j | X i: Ri) J] Prob(5 i(J _ j) | S/ (J _ i+1) , X<, i^), 

where 

Prob(5 iJ |X i ,Jl i ) = ^(^ J ), 
Prob(^(j_j) | Xj, iij) = Prob(^(j_ J ) | S^j-j+^jX^ j+l ,Ri) 

_ q[(J-j+l)( S i(J-j)> S i(J-j+l)) 

q[(j_ j+1) {Si(j- j+ i)) 



The initial state S i0 will be sampled with Prob(S i0 I S i} X i} Ri) = <?ll( /;°' S ' l) . 

Update the recombination count Ri = {Rij}i x j for % = 1, 2, • • • , I. is sampled with 
full conditional probability mass function 



ProWfl I S -mS -„ Q<°> Q« 0' 2 > , ) - ^'(-■~)(^)7f"(l-7 J r^ 
rrob{Kij I ^i(j-i) - m, by -n,Qi ,Q { , 7j) - — ~ 2 pr 2 — 

E r =o<?i '(m,n)( r )7j(l -7i) 2_r 



Update recombination probability jj from Beta (r 7 7oj + Yll=i Rij-> 7-7 (1 ~ 7oj) + 2/ — Xlf =1 -Rj 
for ,7 = 1,2,-. ■ ,J. 

Update the proportion ancestry from population A pi from 

Bin (rfpoi + n$ + ng + n£ + n (2) + 2n (2) , r"(l - p 0i ) + nffi + "So + *4i + + 2n (2) ) , 
where n^/ = Z]/=i = k anci •% = ^ anci = 1) an ^ n -i = J2j=i H^tj = 

I and Rij = 2). 

Update Qf \ Qf~\ QP and Q i0 based on last pi for i — 1, 2, • • ■ , /. 

Update p A and p^ for j = 1, 2, ■ ■ • , J. Let nki = Y2i=i H^ij = k an d -^ij = an d ^n 4 
denotes the case that the allele from population A is variant allele when Sij = 1 and 
Xij = 1. is unobserved and can be imputed from Bin (jin, p A^ ~gp^5 p - P A ) ) ■ * s 
then sampled from Beta (r A p A j + n 2 \ + 2n 22 + n\ A , t a (1 — p^) + n 21 + 2n 20 + — n\ A ) j 
p^ is sampled from Beta (r B p^- + n i + 2n 02 + nn — n\ A , t b (1 — pj|) + n 01 + 2n 00 + n\ A } 

Update Pj based on last p A and p B for j = 1, 2, • • • , J. 

Update r A and r s using Random-Walk Metropolis-Hasting. For t a , we propose the 
new t a * = t a + e where e ~ Ni(0,er^J. The posterior distribution of r A , f{r A \ 
P A ) « nti/Beta^lrX-.^ 1 -^-))^ < ^ < 1000). Then, a(r A ,r A *) = 
min | f flrA\pA) » 1 j- We draw /i A ~ U[0, 1]. If /i^ 4 < a(r A ,r A *), then r A is replaced by 
t a *; otherwise, r A is unchanged. Similar update is conducted for t b . 
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Figure 1. Heatmap of linkage disequilibrium in the chromosome 1 of 1001 African 
Americans, (a) Haplotype linkage disequilibrium, measured by correlation coefficients for 
the number of minor allele between pairs of loci; (b) Admixture linkage disequilibrium, 
measured by correlation coefficients for the local ancestry, i.e. number of Africa ancestry 
allele between pairs of loci, which are inferred using the Hidden Markov Model. 
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Figure 2. Univariate quadratic normal moment prior (a) for r = 0.01 ( — ), r = 0.05 (• ■ ■) 
,r = 0.1 ( ) when p a = 0.8; (b) for p a = 0.8 (— ), Pa = 0.9 (• • •) and Pa = 0.99 ( ), 

when r = 0.01. In both cases, a 2 = 1 and S = Sfy 1 with Pr(5i = 0) = (1 - p a ) 2 , 

Pr(S, t = 1) = 2 Pa (l - Pa ) and Pr(S 4 = 2) = V \. 
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Figure 3. Bivariate quadratic normal moment prior with to 2 = 0.1 and S = (S'S) 1 , 
where S = [Si, S2]', Si = Sip, ■ ■ ■ , <Siooo,i) , S2 = (5*i,2; 5*2,2; • • • ; Siooo^)' an d Sn G 

{0, 1, 2} and S*^ G {0, 1, 2}. We introduce correlation between Sn and 5*^ through the latent 

variables (Zii,Z i2 ), where Z a ~ Ni(0, 1), Z i2 ~ Ni(0, 1) and Cov(Z il ,Z i2 ) = p. let = 
if Zjx ^ Co; Su = 2 if Zn > C\\ and Sn = otherwise with Co = — p a ) 2 ) and 

Ci = $ _1 (1 — p%) where denotes normal inverse cumulative distribution function 

(CDF). We consider four scenarios when p a = 0.8: (a) p = 0; (b) p = 0.25; (c) p = 0.5; (d) 
p = 0.75 with contours drawn beneath the PDF's surface. 
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Figure 4. The type I error rates under the null model (Note the different scaling of the 
Y-axis for panels a and b). The type I error rates are presented for both the binary and 
continuous traits respectively, with or without covariate effect. For each simulated dataset, 
we calculate one type I error rate under the genome- wide threshold level 2 for both methods. 
The results for 100 replications are summarized by the boxplots, where the center bar is 
median, bottom and top of the box are the 25th and 75th percentile and the whiskers 
stretch out till the extreme values. 
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(a) Binary traits without covariate effect 
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(c) Continuous traits without covariate effect 
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(b) Binary traits with covariate effect 
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Figure 5. Powers for single locus alternative models. Power is calculated for each dataset 
with 100 replications total for the binary or continuous traits simulated under the single locus 
alternative model with or without covariate effect. The x indicates the median of powers by 
the GLEAM and • denotes the median of powers by the method based on Bayesian likelihood 
ratio. The whiskers on each bar represent the minimal and maximal powers respectively. The 
effect sizes of local ancestries are equal to the multiplication of effect size multiplier c and 
the proportion of African ancestry population. 
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Figure 6. Manhattan plot of /ogio(Bayes factor) for the association between the averaged 
maternal mean arterial pressure (MAP) during 24 to 28 weeks of pregnancy and genome-wide 
local ancestries among 1001 African Americans. 
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Table 1 

The frequency of identified loci for each locus or locus combination at different regions of the artificial chromosome. 

Trait Method REG1 REG2 REG3 Locusl Locus2 Locusl/2 a 

BLR 0.103 0.047 0.025 1.000 

Binary GLEAMl b 0.013 0.014 0.003 0.020 0.020 0.960 

GLEAM2 C 0.002 0.003 0.001 0.030 0.030 0.940 

BLR 0.035 0.018 0.011 0.030 0.400 0.560 

Continuous GLEAM 1 0.021 0.017 0.004 0.030 0.970 

GLEAM2 0.004 0.003 0.002 0.040 0.960 
a: The combination of Locus 1 and Locus 2 

b: Applying the first step of generalized admixture mapping procedure only; 
c: Applying both steps of generalized admixture mapping procedure; 



